26 research outputs found
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
Near-Memory Address Translation
Virtual memory (VM) is a crucial abstraction in modern computer systems at any scale, from handheld devices to datacenters. VM provides programmers the illusion of an always sufficiently large and linear memory, making programming easier. Although the core components of VM have remained largely unchanged since early VM designs, the design constraints and usage patterns of VM have radically shifted from when it was invented. Today, computer systems integrate hundreds of gigabytes to a few terabytes of memory, while tightly integrated heterogeneous computing platforms (e.g., CPUs, GPUs, FPGAs) are becoming increasingly ubiquitous. As there is a clear trend towards extending the CPU's VM to all computing elements in the system for an efficient and easy to use programming model, the continuous demand for faster memory accesses calls for fast translations to terabytes of memory for any computing element in the system. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of today's translation caches, such as TLBs. In this thesis, we provide fundamental insights into the reason why address translation sits on the critical path of accessing memory. We observe that the traditional fully associative flexibility to map any virtual page to any page frame precludes accessing memory before translating. We study the associativity in VM across a variety of scenarios by classifying page faults using the 3C model developed for caches. Our study demonstrates that the full associativity of VM is unnecessary, and only modest associativity is required. We conclude that capacity and compulsory misses---which are unaffected by associativity---dominate, while conflict misses rapidly disappear as the associativity of VM increases. Building on the modest associativity requirements, we propose a distributed memory management unit close to where the data resides to reduce or eliminate the TLB miss penalty
Prebúsqueda hardware en aplicaciones comerciales
Las técnicas de prebúsqueda intentan solventar la diferencia entre la latencia de acceso a memoria y el tiempo de ciclo del procesador. Esta diferencia, conocida como “Memory Gap” o “Memory Wall”, es de dos órdenes de magnitud y continúa aumentando. Junto a factores térmicos y de alimentación constituye la principal la limitación para incrementar el rendimiento de los procesadores actuales. Técnicas propuestas recientemente aprovechan el hecho de que las secuencias de referencias a memoria se repiten a lo largo del tiempo y además en el mismo orden. El talón de Aquiles de estos prebuscadores es la imposibilidad de predecir accesos a memoria para direcciones que no se han visitado anteriormente. Este coste de oportunidad se magnifica en aplicaciones donde una gran mayoría del conjunto de datos del programa es leía una sola vez. Existe otro tipo de técnicas basadas en la observación de que los programas acceden al espacio de direcciones mediante patrones comunes alineados a regiones de memoria, y son predecibles mediante correlación en código. Una de las limitaciones más importantes de este tipo de prebuscadores es la necesidad de un primer acceso en la región para comenzar la predicción de los bloques que van a ser referenciados dentro de la región, el cual no puede ser amortizado con los bloques prebuscados correctamente más allá del tamaño de la región, que es finito. Otro es el mecanismo en sí, antes explicado que inicia las predicciones. Debido a que realiza la predicción para toda la región, si una predicción es incorrecta perdemos toda la oportunidad de predecir correctamente dentro de la misma. El estado del arte en prebúsqueda de datos correlaciona temporalmente los accesos que inician predicciones dentro de las regiones. Este PFC muestra las ineficiencias de los prebuscadores más avanzados y propone dos técnicas para solventarlas: (1) mecanismo de predicción por secuencia de delta direcciones – iniciando las predicciones mediante la última secuencia de deltas observada en la secuencia de referencias a memoria, y (2) predicción por acceso – cada vez que el procesador envía una petición de datos a la jerarquía de memorias, se realiza una predicción. Los resultados indican que la correlación a través de deltas puede predecir patrones recurrentes de acceso a direcciones de memoria nunca antes referenciadas mejorando la predicción temporal de accesos a memoria, y que las predicciones por acceso mejoran los resultados obtenidos por los prebuscadores basados en regiones eliminando la limitación de coste de oportunidad de las predicciones incorrectas dentro de la región
Pigment Content of D1-D2-Cytochrome b559 Reaction Center Preparations after Removal of CP47 Contamination: An Immunological Study
5 pages, figures, and tables statistics.Isolated D1 -D2-cytochrome b559 photosystem I1 reaction center preparations with pigment
stoichiometry higher than 4 chlorophylls per 2 pheophytins can be contaminated with CP47 proximal
antenna complex. Reaction centers prepared by a modification of the Nanba-Satoh procedure and
containing about 6 chlorophylls per 2 pheophytins showed immuno-cross-reactivity when probed with a
monoclonal antibody raised against the CP47 polypeptide. Furthermore, they could be fractionated
successfully by Superose- 12 sieve chromatography into two different populations. The first few fractions
off the column contained a more definitive 435 nm shoulder corresponding to increased chlorophyll content,
and showed strong immuno-cross-reactivity with the CP47 antibody. The peak fractions off the column
displayed a less prominent 435 nm shoulder, and did not cross-react with the antibody. Moreover, when
a 6-chlorophyll preparation was mixed with Sepharose beads coupled to CP47 antibody, the eluted material
corresponded to a preparation of about 4 chlorophylls per 2 pheophytins and did not show any crossreaction
with the antibody against CP47. The amount of CP47 protein in the 6-chlorophyll preparation
as quantitated using Coomassie Blue staining or from gel blots was sufficient to account for most of the
extra 2 chlorophylls. We conclude that D 1 -D2-cytochrome b559 preparations containing more than 4
chlorophylls per 2 pheophytins can be contaminated with small amounts of CP47-D1 -D2-Cyt b559
complex and that native photosystem I1 reaction centers contain 4 core chlorophylls per 2 pheophytins.Peer reviewe
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
BuMP: Bulk Memory Access Prediction and Streaming
With the end of Dennard scaling, server power has emerged as the limiting factor in the quest for more capable datacenters. Without the benefit of supply voltage scaling, it is essential to lower the energy per operation to improve server efficiency. As the industry moves to lean-core server processors, the energy bottleneck is shifting toward main memory as a chief source of server energy consumption in modern datacenters. Maximizing the energy efficiency of today's DRAM chips and interfaces requires amortizing the costly DRAM page activations over multiple row buffer accesses. This work introduces Bulk Memory Access Prediction and Streaming, or BuMP. We make the observation that a significant fraction (59-79%) of all memory accesses fall into DRAM pages with high access density, meaning that the majority of their cache blocks will be accessed within a modest time frame of the first access. Accesses to high-density DRAM pages include not only memory reads in response to load instructions, but also reads stemming from store instructions as well as memory writes upon a dirty LLC eviction. The remaining accesses go to low-density pages and virtually unpredictable reference patterns (e.g., hashed key lookups). BuMP employs a low-cost predictor to identify high-density pages and triggers bulk transfer operations upon the first read or write to the page. In doing so, BuMP enforces high row buffer locality where it is profitable, thereby reducing DRAM energy per access by 23%, and improves server throughput by 11% across a wide range of server applications
Unlocking Energy
Locks are a natural place for improving the energy efficiency of software systems. First, concurrent systems are mainstream and when their threads synchronize, they typically do it with locks. Second, locks are well-defined abstractions, hence changing the algorithm implementing them can be achieved without modifying the system. Third, some locking strategies consume more power than others, thus the strategy choice can have a real effect. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with improving their throughput. It is a win-win situation. We make our case for this throughput/energy-efficiency correlation through a series of observations obtained from an exhaustive analysis of the energy efficiency of locks on two modern processors and six software systems: Memcached, MySQL, SQLite, RocksDB, HamsterDB, and Kyoto Kabinet. We propose simple lock-based techniques for improving the energy efficiency of these systems by 33% on average, driven by higher throughput, and without modifying the systems
Meet the Walkers:Accelerating Index Traversals for In-memory Databases
The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the de-sign of DBMS accelerators that (a) maximize utility by ac-celerating the dominant operations, and (b) provide flexibil-ity in the choice of DBMS, data layout, and data types. We study data analytics workloads on contemporary in-memory databases and find hash index lookups to be the largest single contributor to the overall execution time. The critical path in hash index lookups consists of ALU-intensive key hashing followed by pointer chasing through a node list. Based on these observations, we introduce Widx, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by (1) decoupling key hashing from the list traversal, and (2) processing multiple keys in parallel on a set of programmable walker units. Widx reduces design cost and complexity through its tight integra-tion with a conventional core, thus eliminating the need for a dedicated TLB and cache. An evaluation of Widx on a set of modern data analytics workloads (TPC-H, TPC-DS) us-ing full-system simulation shows an average speedup of 3.1x over an aggressive OoO core on bulk hash table operations, while reducing the OoO core energy by 83%